NOTE: The dataset, visualizations, and result outputs in this presentation are not representative for any types of business, users, reviews in Yelp.

1. Yelp Academic Dataset #1 (Small)


1.1. Initial Questions

  • Initial questions without the Yelp data
    • Who are the average Yelp users?
      • age group, gender distribution, average number of checkin,…?
    • What types of food restaurants are rated higher?
      • vegetarian restaurant, pub, steak restaurant, family restaurant, etc?
    • New Yorkers are more likely to give higher review scores than people in other states?
    • Female tends to give higher review scores (and / or) more review counts than male?
    • More restaurants with high review scores in New York than other states?




1.2. Simple exploratory analysis about the Yelp dataset:

  • 1 table
    • Business (24 variables, 474,434 observations, 289.4 MB)
  • Qustions with the data
    • Average review ratings by state
    • More detailed review ratings by state


ER Diagram for Yelp Samll Dataset


  • Columns of interest
    • stars (1.0, 1.5, 2.0, ~ 4.5, 5.0)
    • review_count
    • state (16 states)
    • latitude/longitude




1.3. Average review ratings by state.


a. Pulling the data

dataGroupByStateStar <- ylpDataSmall %>% 
  filter(state != '') %>% mutate(tsum = n()) %>% 
  group_by(state, stars) 

dataForTableByStateStar <- dataGroupByStateStar %>% group_by(state) %>%
  summarise(total_business = n(), total_reviews = sum(review_count), avg_rating = round(mean(stars), 2))



b. Loading the data on the table

library(pander)
panderOptions("digits", 3)
pander(dataForTableByStateStar)
state total_business total_reviews avg_rating
CA 4000 141119 3.69
GA 500 15455 3.58
IL 500 7188 3.51
IN 393 3041 3.71
MA 1298 54477 3.6
MD 500 5813 3.36
MI 500 11634 3.66
NC 500 8147 3.84
NJ 500 7904 3.37
NY 1382 23675 3.49
ON 228 1014 3.69
PA 1000 20078 3.55
RI 500 11086 3.64
TX 1000 23935 3.72
VA 189 1503 3.56
WA 500 17998 3.64


c. Loading the data on the Leaflet map

library(leaflet)
leaflet(dataTotalAvgStarByState) %>% addTiles() %>% setView(lng = -96.503906, 
    lat = 38.68551, zoom = 4) %>% addCircles(lng = ~city_lng, lat = ~city_lat, 
    weight = 0, radius = ~exp(totAvgRatingByState * 1.4) * 800, fillOpacity = 0.5, 
    color = ~myCol(totAvgRatingByState), popup = ~totAvgRatingByState) %>% addLegend("bottomleft", 
    pal = myCol, values = ~sort(totAvgRatingByState), title = "Avg.Ratings", 
    labFormat = labelFormat(prefix = ""), opacity = 0.5)



1.4. A grid of detailed average ratings by state.


a. Pulling the data

dataWeightedGroupByStateStar <- dataGroupByStateStar %>% 
  summarise(totalByStar = n()) %>% arrange(desc(stars)) %>% 
  mutate(total = sum(totalByStar)) %>% mutate(percent = round((totalByStar / total)*100, 1)) %>%
  mutate(percentWeight = ifelse(percent >= 20, percent * 2.5, # custom column to weight the percent for size on the plot
                                ifelse(percent < 20 & percent >= 15, percent * 1.2, 
                                       ifelse(percent < 15 & percent >= 10, percent,
                                              ifelse(percent < 10 & percent >= 5, percent * 0.8, 1)))))


b. Loading the data on the ggplot bubble plot

library(ggplot2)
ggplot(dataWeightedGroupByStateStar, aes(x = state, y = stars, label = percent)) + 
    geom_point(aes(size = percentWeight * 2, colour = stars, alpha = 0.05)) + 
    geom_text(hjust = 0.4, size = 4) + scale_size(range = c(1, 30), guide = "none") + 
    scale_color_gradient(low = "darkblue", high = "red") + labs(title = "A grid of detailed avg.ratings by state ", 
    x = "State", y = "Detailed Avg.Ratings") + scale_y_continuous(breaks = seq(1, 
    5, 0.5)) + theme(legend.title = element_blank())



2. Yelp Academic Dataset #2 (Big)


2.1. Initial questions

  • Initial questions
    • Users with more followers at Yelp –> purchase more (or checkin more frequently)
    • Female users –> More frequent reviews
    • Restaurants in the higher price range –> higher review ratings?
  • Source link is unavailable now.



2.2. Simple exploratory analysis:

  • 5 tables
    • Business (98 variables, 77,445 observations, 30.4MB)
    • User (23 variables, 552,339 observations, 135.9MB)
    • Reviews (10 variables, 2,225,213 observations, 1.64GB)
    • Checkin (170 variables, 55,569 observations, 13.2MB)
    • Tip (6 variables, 591,864 observations, 74.5MB)


ER Diagram for Yelp Big Dataset

  • Columns of interests (in User table)
    • average_stars (1.0, 1.5, 2.0, ~ 4.5, 5.0)
    • elite
    • fans


  • My qustions with the data
    • Having more followers at Yelp –> Rate more frequently?
    • Having more followers at Yelp –> Rate higher?


  • elite user group: (reference)
    • Active reviewers elected by Yelp’s National Elite Squad Council
    • Selection criteria unknown
      • Examples: well-written reviews, high quality tips, a detailed personal profile, an active voting and complimenting record, and a history of playing well with others.
    • invited to private events where up-and-coming restaurants and bars provide food and drinks for free.



2.3. (Q1) Having more followers at Yelp –> Rate more frequently?


a. Pulling the data

ylpUserSmElite <- ylpUserSm3 %>% filter(elite != "[]")
ylpUserSmNormal <- ylpUserSm3 %>% filter(elite == "[]")



b. Loading the data on the box plot

* All users:

library(ggthemes)
# Yelp users in the boxplot
qplot(fans, review_count, data = ylpUserSm3, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans") + 
    theme(legend.position = "none")



* Elite users:

# Elite Yelp group users in the boxplot
qplot(fans, review_count, data = ylpUserSmElite, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans (Elite users)") + 
    theme(legend.position = "none")



* normal users:

# Non-elite Yelp group users in the boxplot
qplot(fans, review_count, data = ylpUserSmNormal, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Total review counts by the number of fans (Non-elite Users)") + 
    theme(legend.position = "none")



c. Loading the data on the combination plots (point+smooth)

* All users:

# Yelp users in combination plots
qplot(fans, review_count, data = ylpUserSm1, geom = c("point", "smooth"), colour = fans) + 
    labs(title = "Total review counts by the number of fans") + scale_color_gradient(low = "darkblue", 
    high = "darkred") + stat_smooth(fill = "green", colour = "cyan", size = 1, 
    alpha = 0.1)



* elite user group:

# Elite Yelp group users in combination plots
qplot(fans, review_count, data = ylpUserSmElite, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Total review counts by the number of fans (Elite users)") + 
    scale_color_gradient(low = "darkblue", high = "darkred") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



* normal users:

# Non-elite Yelp group users in combination plots
qplot(fans, review_count, data = ylpUserSmNormal, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Total review counts by the number of fans (Non-elite users)") + 
    scale_color_gradient(low = "darkblue", high = "darkred") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



2.3. (Q2) Having more followers at Yelp –> Rate higher?


a. Loading the data on the box plot

* All users:

# Yelp users in the boxplot
qplot(fans, average_stars, data = ylpUserSm3, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans") + 
    theme(legend.position = "none")



* elite user group:

# Elite Yelp group users in the boxplot
qplot(fans, average_stars, data = ylpUserSmElite, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans (Elite users)") + 
    theme(legend.position = "none")



* normal users:

# Non-elite Yelp group users in the boxplot
qplot(fans, average_stars, data = ylpUserSmNormal, geom = "boxplot", group = Fan_Size, 
    color = Fan_Size) + labs(title = "Average ratings by the number of fans (Non-elite users)") + 
    theme(legend.position = "none")



c. Loading the data on the combination plots (point+smooth)

* All users:

# Yelp users in combination plots
qplot(fans, average_stars, data = ylpUserSm1, geom = c("point", "smooth"), colour = fans) + 
    labs(title = "Average ratings by the number of fans") + scale_color_gradient(low = "darkblue", 
    high = "darkred") + stat_smooth(fill = "green", colour = "cyan", size = 1, 
    alpha = 0.1)



* elite user group:

# Elite Yelp group users in combination plots
qplot(fans, average_stars, data = ylpUserSmElite, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Average ratings by the number of fans (Elite users)") + 
    scale_color_gradient(low = "darkblue", high = "red") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



* normal users:

# Non-elite Yelp group users in combination plots
qplot(fans, average_stars, data = ylpUserSmNormal, geom = c("point", "smooth"), 
    colour = fans) + labs(title = "Average ratings by the number of fans (Non-elite users)") + 
    scale_color_gradient(low = "darkblue", high = "red") + stat_smooth(fill = "green", 
    colour = "cyan", size = 1, alpha = 0.1)



3. Conclusions


3.1. Findings


  • The number of followers at Yelp and Average rating frequency
    • Yelp users tend to rate more frequently until some point as the number of followers increase
  • The number of followers at Yelp and Average rating frequency
    • Yelp users, (specifically Elite users), tend to rate more higher as the number of followers increase



3.2. Next steps

  • interactive Shiny app
  • Statistical analyses for significance of the result